{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Simple Directory Reader"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `SimpleDirectoryReader` is the most commonly used data connector that _just works_.  \n",
    "Simply pass in a input directory or a list of files.  \n",
    "It will select the best file reader based on the file extensions.  "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Started"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import SimpleDirectoryReader"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load specific files "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "reader = SimpleDirectoryReader(\n",
    "    input_files=[\"../data/paul_graham/paul_graham_essay.txt\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 1 docs\n"
     ]
    }
   ],
   "source": [
    "docs = reader.load_data()\n",
    "print(f\"Loaded {len(docs)} docs\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load all (top-level) files from directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "reader = SimpleDirectoryReader(input_dir=\"../../end_to_end_tutorials/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 72 docs\n"
     ]
    }
   ],
   "source": [
    "docs = reader.load_data()\n",
    "print(f\"Loaded {len(docs)} docs\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load all (recursive) files from directory "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# only load markdown files\n",
    "required_exts = [\".md\"]\n",
    "\n",
    "reader = SimpleDirectoryReader(\n",
    "    input_dir=\"../../end_to_end_tutorials\", required_exts=required_exts, recursive=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 174 docs\n"
     ]
    }
   ],
   "source": [
    "docs = reader.load_data()\n",
    "print(f\"Loaded {len(docs)} docs\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Full Configuration"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the full list of arguments that can be passed to the `SimpleDirectoryReader`:\n",
    "\n",
    "```python\n",
    "class SimpleDirectoryReader(BaseReader):\n",
    "    \"\"\"Simple directory reader.\n",
    "\n",
    "    Load files from file directory. \n",
    "    Automatically select the best file reader given file extensions.\n",
    "\n",
    "\n",
    "    Args:\n",
    "        input_dir (str): Path to the directory.\n",
    "        input_files (List): List of file paths to read\n",
    "            (Optional; overrides input_dir, exclude)\n",
    "        exclude (List): glob of python file paths to exclude (Optional)\n",
    "        exclude_hidden (bool): Whether to exclude hidden files (dotfiles).\n",
    "        encoding (str): Encoding of the files.\n",
    "            Default is utf-8.\n",
    "        errors (str): how encoding and decoding errors are to be handled,\n",
    "                see https://docs.python.org/3/library/functions.html#open\n",
    "        recursive (bool): Whether to recursively search in subdirectories.\n",
    "            False by default.\n",
    "        filename_as_id (bool): Whether to use the filename as the document id.\n",
    "            False by default.\n",
    "        required_exts (Optional[List[str]]): List of required extensions.\n",
    "            Default is None.\n",
    "        file_extractor (Optional[Dict[str, BaseReader]]): A mapping of file\n",
    "            extension to a BaseReader class that specifies how to convert that file\n",
    "            to text. If not specified, use default from DEFAULT_FILE_READER_CLS.\n",
    "        num_files_limit (Optional[int]): Maximum number of files to read.\n",
    "            Default is None.\n",
    "        file_metadata (Optional[Callable[str, Dict]]): A function that takes\n",
    "            in a filename and returns a Dict of metadata for the Document.\n",
    "            Default is None.\n",
    "\"\"\"\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
